9 research outputs found

    Character-Level Incremental Speech Recognition with Recurrent Neural Networks

    Full text link
    In real-time speech recognition applications, the latency is an important issue. We have developed a character-level incremental speech recognition (ISR) system that responds quickly even during the speech, where the hypotheses are gradually improved while the speaking proceeds. The algorithm employs a speech-to-character unidirectional recurrent neural network (RNN), which is end-to-end trained with connectionist temporal classification (CTC), and an RNN-based character-level language model (LM). The output values of the CTC-trained RNN are character-level probabilities, which are processed by beam search decoding. The RNN LM augments the decoding by providing long-term dependency information. We propose tree-based online beam search with additional depth-pruning, which enables the system to process infinitely long input speech with low latency. This system not only responds quickly on speech but also can dictate out-of-vocabulary (OOV) words according to pronunciation. The proposed model achieves the word error rate (WER) of 8.90% on the Wall Street Journal (WSJ) Nov'92 20K evaluation set when trained on the WSJ SI-284 training set.Comment: To appear in ICASSP 201

    Single stream parallelization of generalized LSTM-like RNNs on a GPU

    Full text link
    Recurrent neural networks (RNNs) have shown outstanding performance on processing sequence data. However, they suffer from long training time, which demands parallel implementations of the training procedure. Parallelization of the training algorithms for RNNs are very challenging because internal recurrent paths form dependencies between two different time frames. In this paper, we first propose a generalized graph-based RNN structure that covers the most popular long short-term memory (LSTM) network. Then, we present a parallelization approach that automatically explores parallelisms of arbitrary RNNs by analyzing the graph structure. The experimental results show that the proposed approach shows great speed-up even with a single training stream, and further accelerates the training when combined with multiple parallel training streams.Comment: Accepted by the 40th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 201

    Fixed-Point Performance Analysis of Recurrent Neural Networks

    Full text link
    Recurrent neural networks have shown excellent performance in many applications, however they require increased complexity in hardware or software based implementations. The hardware complexity can be much lowered by minimizing the word-length of weights and signals. This work analyzes the fixed-point performance of recurrent neural networks using a retrain based quantization method. The quantization sensitivity of each layer in RNNs is studied, and the overall fixed-point optimization results minimizing the capacity of weights while not sacrificing the performance are presented. A language model and a phoneme recognition examples are used

    μž¬κ·€ν˜• 인곡신경망을 μ΄μš©ν•œ 온라인 μŒμ„±μΈμ‹

    Get PDF
    ν•™μœ„λ…Όλ¬Έ (박사)-- μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› : 전기·컴퓨터곡학뢀, 2017. 2. μ„±μ›μš©.μž¬κ·€ν˜• 인곡신경망(recurrent neural network, RNN)은 졜근 μ‹œν€€μŠ€-투-μ‹œν€€μŠ€(sequence-to-sequence) λ°©μ‹μ˜ μ—¬λŸ¬ λͺ¨λΈμ—μ„œ 쒋은 μ„±λŠ₯을 보여 μ™”λ‹€. 졜근의 μŒμ„±μΈμ‹μ—μ„œ μ‚¬μš©ν•˜λŠ” 쒅단간(end-to-end) ν›ˆλ ¨ λ°©μ‹μ˜ λ°œμ „μœΌλ‘œ 인해, RNN은 일련의 μ˜€λ””μ˜€ νŠΉμ§•(feature)을 μž…λ ₯으둜 ν•˜κ³  일련의 κΈ€μž(character) ν˜Ήμ€ 단어듀을 좜λ ₯으둜 ν•˜λŠ” λ‹¨μΌν•œ ν•¨μˆ˜λ₯Ό ν•™μŠ΅ν•  수 있게 λ˜μ—ˆλ‹€. 이 ν•¨μˆ˜λŠ” 쀑간에 μŒμ†Œ λ‹¨μœ„ ν˜Ήμ€ 발음 사전(lexicon) λ‹¨μœ„μ˜ λ³€ν™˜μ„ κ±°μΉ˜μ§€ μ•ŠλŠ”λ‹€. μ§€κΈˆκΉŒμ§€, λŒ€λΆ€λΆ„μ˜ 쒅단간 μŒμ„±μΈμ‹μ€ κΈ°μ‘΄ λ°©μ‹μœΌλ‘œ 얻은 높은 정확도λ₯Ό λ”°λΌκ°€λŠ” 데 초점이 맞좰져 μžˆμ—ˆλ‹€. ν•˜μ§€λ§Œ, 비둝 쒅단간 μŒμ„±μΈμ‹ λͺ¨λΈμ΄ κΈ°μ‘΄ μŒμ„±μΈμ‹ λͺ¨λΈλ§ŒνΌμ˜ 정확도λ₯Ό λ‹¬μ„±ν–ˆμŒμ—λ„, 이 λͺ¨λΈμ€ 보톡 미리 μž˜λΌμ§„ μ˜€λ””μ˜€ 데이터λ₯Ό μ‚¬μš©ν•˜λŠ” λ°œν™” λ‹¨μœ„μ˜ μŒμ„±μΈμ‹μ— μ‚¬μš©λ˜μ—ˆκ³ , μ‹€μ‹œκ°„μœΌλ‘œ 연속적인 μ˜€λ””μ˜€ 데이터λ₯Ό λ°›μ•„ μ‚¬μš©ν•˜λŠ” μŒμ„±μΈμ‹μ—λŠ” 잘 μ‚¬μš©λ˜μ§€ μ•Šμ•˜λ‹€. 이것은 미리 μž˜λΌμ§„ λ°μ΄ν„°λ‘œ ν•™μŠ΅ν•œ RNN은 맀우 κΈ΄ μ˜€λ””μ˜€ μž…λ ₯에 λŒ€ν•΄μ„œλ„ 잘 λ™μž‘ν•˜λ„λ‘ μΌλ°˜ν™”(generalization)κ°€ 되기 μ–΄λ €μ› κΈ° λ•Œλ¬Έμ΄λ‹€. μœ„ 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄, λ³Έ λ…Όλ¬Έμ—μ„œλŠ” λ¬΄ν•œνžˆ κΈ΄ μ‹œν€€μŠ€λ₯Ό μ‚¬μš©ν•˜λŠ” RNN ν›ˆλ ¨ 방법을 μ œμ•ˆν•œλ‹€. λ¨Όμ €, 이λ₯Ό μœ„ν•œ 효과적인 κ·Έλž˜ν”½ ν”„λ‘œμ„Έμ„œ(graphics processing unit, GPU) 기반 RNN ν›ˆλ ¨ ν”„λ ˆμž„μ›Œν¬(framework)λ₯Ό μ„€λͺ…ν•œλ‹€. 이 ν”„λ ˆμž„μ›Œν¬λŠ” μ œν•œλœ μ‹œκ°„μΆ• μ—­μ „νŒŒ(truncated backpropagation through time, truncated BPTT)λ₯Ό μ‚¬μš©ν•΄ ν›ˆλ ¨λ˜λ©°, 덕뢄에 μ‹€μ‹œκ°„μœΌλ‘œ λ“€μ–΄μ˜€λŠ” 연속적인 데이터λ₯Ό μ‚¬μš©ν•˜μ—¬ ν›ˆλ ¨ν•  수 μžˆλ‹€. λ‹€μŒμœΌλ‘œ, μ—°κ²°μ„± μ‹œκ³„μ—΄ λΆ„λ₯˜κΈ°(connectionist temporal classification, CTC) μ•Œκ³ λ¦¬μ¦˜μ˜ 손싀(loss) 계산 방식을 λ³€ν˜•ν•œ μ‹€μ‹œκ°„ CTC ν•™μŠ΅ μ•Œκ³ λ¦¬μ¦˜μ„ 선보인닀. μƒˆλ‘­κ²Œ 선보인 CTC 손싀 계산 μ•Œκ³ λ¦¬μ¦˜μ€ truncated BPTT 기반의 RNN ν›ˆλ ¨μ— λ°”λ‘œ 적용될 수 μžˆλ‹€. λ‹€μŒμœΌλ‘œ, RNN만으둜 κ΅¬μ„±λœ 쒅단간 μ‹€μ‹œκ°„ μŒμ„±μΈμ‹ λͺ¨λΈμ„ μ†Œκ°œν•œλ‹€. 이 λͺ¨λΈμ€ 크게 CTC 좜λ ₯을 μ‚¬μš©ν•˜λŠ” 음ν–₯(acoustic) RNNκ³Ό κΈ€μž λ‹¨μœ„ RNN μ–Έμ–΄ λͺ¨λΈ(language model)둜 κ΅¬μ„±λœλ‹€. 그리고, 접두사 트리(prefix-tree) 기반의 μƒˆλ‘œμš΄ λΉ” 탐색(beam search)이 μ‚¬μš©λ˜μ–΄ λ¬΄ν•œν•œ μž…λ ₯ μ˜€λ””μ˜€μ— λŒ€ν•΄ λ””μ½”λ”©(decoding)을 μˆ˜ν–‰ν•  수 μžˆλ‹€. 이 λ””μ½”λ”© λ°©μ‹μ—λŠ” μƒˆλ‘œμš΄ λΉ” κ°€μ§€μΉ˜κΈ°(beam pruning) μ•Œκ³ λ¦¬μ¦˜μ΄ λ„μž…λ˜μ–΄ 트리 ꡬ쑰의 크기가 μ§€μˆ˜μ μœΌλ‘œ μ¦κ°€ν•˜λŠ” 것을 λ°©μ§€ν•œλ‹€. μœ„ μŒμ„±μΈμ‹ λͺ¨λΈμ—λŠ” λ³„λ„μ˜ μŒμ†Œ λͺ¨λΈμ΄λ‚˜ 발음 사전이 ν¬ν•¨λ˜μ–΄ μžˆμ§€ μ•Šκ³ , λ¬΄ν•œνžˆ κΈ΄ 일련의 μ˜€λ””μ˜€μ— λŒ€ν•΄ 디코딩을 μˆ˜ν–‰ν•  수 μžˆλ‹€λŠ” νŠΉμ§•μ΄ μžˆλ‹€. μœ„ λͺ¨λΈμ€ λ˜ν•œ λ‹€λ₯Έ 쒅단간 λͺ¨λΈμ— λΉ„ν•΄ 맀우 적은 λ©”λͺ¨λ¦¬λ₯Ό μ‚¬μš©ν•˜λ©΄μ„œλ„ 비견될 λ§Œν•œ 정확도λ₯Ό 보인닀. λ§ˆμ§€λ§‰μœΌλ‘œ, λ³Έ λ…Όλ¬Έμ—μ„œλŠ” κ³„μΈ΅ν˜• ꡬ쑰(hierarchical structure)λ₯Ό μ΄μš©ν•΄ κΈ€μž λ‹¨μœ„ RNN μ–Έμ–΄ λͺ¨λΈμ˜ μ„±λŠ₯을 ν–₯μƒμ‹œμΌ°λ‹€. 특히, 이 κΈ€μž λ‹¨μœ„ RNN λͺ¨λΈμ€ λΉ„μŠ·ν•œ νŒŒλΌλ―Έν„° 수λ₯Ό κ°–λŠ” 단어 λ‹¨μœ„ RNN μ–Έμ–΄ λͺ¨λΈλ³΄λ‹€ κ°œμ„ λœ 예츑 λ³΅μž‘λ„(perplexity)λ₯Ό λ‹¬μ„±ν•˜μ˜€λ‹€. λ˜ν•œ, 이 κΈ€μž λ‹¨μœ„ RNN μ–Έμ–΄ λͺ¨λΈμ„ μ•žμ„œ μ„€λͺ…ν•œ κΈ€μž λ‹¨μœ„ μ‹€μ‹œκ°„ μŒμ„±μΈμ‹ μ‹œμŠ€ν…œμ— μ μš©ν•˜μ—¬ λ”μš± 적은 연산을 μ‚¬μš©ν•˜λ©΄μ„œλ„ μŒμ„±μΈμ‹ 정확도λ₯Ό ν–₯μƒμ‹œν‚¬ 수 μžˆμ—ˆλ‹€.Recurrent neural networks (RNNs) have shown outstanding sequence to sequence modeling performance. Thanks to recent advances in end-to-end training approaches for automatic speech recognition (ASR), RNNs can learn direct mapping functions from the sequence of audio features to the sequence of output characters or words without any intermediate phoneme or lexicon layers. So far, majority of studies on end-to-end ASR have been focused on increasing the accuracy of speech recognition to the level of traditional state-of-the-art models. However, although the end-to-end ASR models have reached the accuracy of the traditional systems, their application has usually been limited to utterance-level speech recognition with pre-segmented audio instead of online speech recognition with continuous audio. This is because the RNNs cannot be easily generalized to very long streams of audio when they are trained with segmented audio. To address this problem, we propose an RNN training approach on training sequences with virtually infinite length. Specifically, we describe an efficient GPU-based RNN training framework for the truncated backpropagation through time (BPTT) algorithm, which is suitable for online (continuous) training. Then, we present an online version of the connectionist temporal classification (CTC) loss computation algorithm, where the original CTC loss is estimated with partial sliding window. This modified CTC algorithm can be directly employed for truncated BPTT based RNN training. In addition, a fully RNN based end-to-end online ASR model is proposed. The model is composed of an acoustic RNN with CTC output and a character-level RNN language model that is augmented with a hierarchical structure. Prefix-tree based beam search decoding is employed with a new beam pruning algorithm to prevent exponential growth of the tree. The model is free from phoneme or lexicon models, and can be used for decoding infinitely long audio sequences. Also, this model has very small memory footprint compared to the other end-to-end systems while showing the competitive accuracy. Furthermore, we propose an improved character-level RNN LM with a hierarchical structure. This character-level RNN LM shows improved perplexity compared to the lightweight word-level RNN LM with a comparable size. When this RNN LM is applied to the proposed character-level online ASR, better speech recognition accuracy can be achieved with reduced amount of computation.1 Introduction 1 1.1 Automatic Speech Recognition 1 1.1.1 Traditional ASR 2 1.1.2 End-to-End ASR with Recurrent Neural Networks 3 1.1.3 Offline and Online ASR 3 1.2 Scope of the Dissertation 4 1.2.1 End-to-End Online ASR with RNNs 4 1.2.2 Challenges and Contributions 5 2 Flexible and Efficient RNN Training on GPUs 7 2.1 Introduction 7 2.2 Generalization 9 2.2.1 Generalized RNN Structure 9 2.2.2 Training 11 2.3 Parallelization 15 2.3.1 Intra-Stream Parallelism 15 2.3.2 Inter-Stream Parallelism 17 2.4 Experiments 18 2.5 Concluding Remarks 21 3 Online Sequence Training with Connectionist Temporal Classification 22 3.1 Introduction 22 3.2 Connectionist Temporal Classification 25 3.3 Online Sequence Training 28 3.3.1 Problem Definition 28 3.3.2 Overview of the Proposed Approach 29 3.3.3 CTC-TR: Standard CTC with Truncation 31 3.3.4 CTC-EM: EM-Based Online CTC 32 3.4 Training Continuously Running RNNs 37 3.5 Parallel Training 38 3.6 Experiments 39 3.6.1 End-to-End Speech Recognition with RNNs 39 3.6.2 Phoneme Recognition on TIMIT 46 3.7 Concluding Remarks 51 4 Character-Level Incremental Speech Recognition 52 4.1 Introduction 52 4.2 Models 54 4.2.1 Acoustic Model 54 4.2.2 Language Model 56 4.3 Character-Level Beam Search 57 4.3.1 Prefix-Tree-Based CTC Beam Search 57 4.3.2 Pruning 60 4.4 Experiments 62 4.5 Concluding Remarks 65 5 Character-Level Language Modeling with Hierarchical RNNs 66 5.1 Introduction 66 5.2 Related Work 68 5.2.1 Character-Level Language Modeling with RNNs 68 5.2.2 Character-Aware Word-Level Language Modeling 69 5.3 RNNs with External Clock and Reset Signals 70 5.4 Character-Level Language Modeling with a Hierarchical RNN 72 5.5 Experiments 75 5.5.1 Perplexity 76 5.5.2 End-to-End Automatic Speech Recognition (ASR) 79 5.6 Concluding Remarks 81 6 Conclusion 83 Bibliography 85 Abstract in Korean 98Docto

    FPGA-Based Low-Power Speech Recognition with Recurrent Neural Networks

    Full text link
    In this paper, a neural network based real-time speech recognition (SR) system is developed using an FPGA for very low-power operation. The implemented system employs two recurrent neural networks (RNNs); one is a speech-to-character RNN for acoustic modeling (AM) and the other is for character-level language modeling (LM). The system also employs a statistical word-level LM to improve the recognition accuracy. The results of the AM, the character-level LM, and the word-level LM are combined using a fairly simple N-best search algorithm instead of the hidden Markov model (HMM) based network. The RNNs are implemented using massively parallel processing elements (PEs) for low latency and high throughput. The weights are quantized to 6 bits to store all of them in the on-chip memory of an FPGA. The proposed algorithm is implemented on a Xilinx XC7Z045, and the system can operate much faster than real-time.Comment: Accepted to SiPS 201
    corecore